{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Make weighted manual subset\n",
    "\n",
    "To reduce the amount of manual coding labor, we didn't create the weighted subset through a completely separate sampling operation. Instead Underwood calculated how a weighted sample would be distributed across the range of reprinting-frequencies, [and randomly sampled a *supplement*](https://github.com/tedunderwood/noveltmmeta/blob/master/manuallists/make_weighted_supplement.ipynb) to make up the difference between a sample representing distinct titles and a sample representing volumes.\n",
    "\n",
    "Now we have to load that supplement and merge it with the title sample, after randomly removing volumes from the title sample. All the volumes we remove will be those only appearing once in Hathi; the volumes we add will all have more than one appearance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from matplotlib import pyplot as plt\n",
    "from collections import Counter\n",
    "from difflib import SequenceMatcher\n",
    "from sklearn.metrics import cohen_kappa_score\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load the data, standardize data dictionary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "j = pd.read_csv('weighteddata/jessicamanycopies.csv', index_col = 'docid')\n",
    "p = pd.read_csv('weighteddata/patrickmanycopies.csv', index_col = 'docid')\n",
    "t = pd.read_csv('weighteddata/tedmanycopies.csv', index_col = 'docid')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "j['gender'] = j['gender'].fillna(value = 'u')\n",
    "p['gender'] = p['gender'].fillna(value = 'u')\n",
    "t['gender'] = t['gender'].fillna(value = 'u')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def dominant_category(astring):\n",
    "    ''' Accepts a category string that may contain multiple\n",
    "    values, and reduces it to a single category.\n",
    "    '''\n",
    "    astring = astring.replace(' ', '')\n",
    "    cats = astring.split('|')\n",
    "    if 'nonfic' in cats:\n",
    "        return 'nonfic'\n",
    "    if 'poetry' in cats:\n",
    "        return 'poetry'\n",
    "    if 'drama' in cats:\n",
    "        return 'drama'\n",
    "    if 'reprint' in cats:\n",
    "        return 'reprint'\n",
    "    if 'juvenile' in cats:\n",
    "        return 'juvenile'\n",
    "    if 'shortstories' in cats:\n",
    "        return 'shortstories'\n",
    "    if 'novel' in cats:\n",
    "        return 'novel'\n",
    "    else:\n",
    "        return 'error'\n",
    "\n",
    "j = j.assign(category = j.category.map(dominant_category))\n",
    "p = p.assign(category = p.category.map(dominant_category))\n",
    "t = t.assign(category = t.category.map(dominant_category))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How much do human readers agree?\n",
    "\n",
    "The three readers in this portion of the project worked on overlapping sets of books, allowing us to roughly estimate average levels of agreement.\n",
    "\n",
    "Note that \"agree\" may be the wrong verb here, if it suggests as the other alternative a settled and irreconcilable *dis*agreement. Many cases of \"disagreement\" below are just data entry errors. In other cases, we probably could have come to consensus given a bit more time to discuss the meanings of categories. But in this project, we weren't working to stabilize a local consensus for the purposes of a particular experiment. We were rather producing a dataset that we expect to be borrowed by other people, who may or may not share our consensus. So it seemed appropriate to report a first (low) estimate of agreement; that will be more likely to reflect the intersubjective conditions operative for users of this project who know nothing but the names of the categories we used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def measure_agreement(category):\n",
    "    agreement = Counter()\n",
    "    differences = []\n",
    "    \n",
    "    firstlist = []\n",
    "    secondlist = []\n",
    "    \n",
    "    frames = [j, p, t]\n",
    "    for index1 in range(0, 3):\n",
    "        for index2 in range(index1 + 1, 3):\n",
    "            print(index1, ' - ', index2)\n",
    "            frame1 = frames[index1]\n",
    "            frame2 = frames[index2]\n",
    "            overlap = set(frame1.index).intersection(set(frame2.index))\n",
    "            for docid in overlap:\n",
    "                value1 = frame1.loc[docid, category]\n",
    "                value2 = frame2.loc[docid, category]\n",
    "                firstlist.append(value1)\n",
    "                secondlist.append(value2)\n",
    "                \n",
    "                if value1 == value2:\n",
    "                    agreement[True] += 1\n",
    "                elif pd.isnull(value1) and pd.isnull(value2):\n",
    "                    agreement[True] += 1\n",
    "                else:\n",
    "                    agreement[False] += 1\n",
    "                    differences.append((value1, value2, docid))\n",
    "    \n",
    "    k = cohen_kappa_score(firstlist, secondlist)\n",
    "\n",
    "    return agreement, differences, k"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0  -  1\n",
      "0  -  2\n",
      "1  -  2\n"
     ]
    }
   ],
   "source": [
    "genagreement, genderdiffs, k = measure_agreement('gender')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9506172839506173"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "genagreement[True] / (genagreement[False] + genagreement[True])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('m', 'u', 'uc2.ark+=13960=t08w3hh7p'),\n",
       " ('m', 'f', 'mdp.39015039779825'),\n",
       " ('u', 'm', 'uc1.b3295312'),\n",
       " ('m', 'f', 'uc2.ark+=13960=t3xs5q555'),\n",
       " ('u', 'm', 'mdp.39015000017445'),\n",
       " ('m', 'u', 'mdp.39015061860105'),\n",
       " ('f', 'm', 'mdp.39015016445929'),\n",
       " ('m', 'f', 'nyp.33433074865746')]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "genderdiffs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.89534883720930236"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "k"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0  -  1\n",
      "0  -  2\n",
      "1  -  2\n"
     ]
    }
   ],
   "source": [
    "catagreement, catdiffs, kappa = measure_agreement('category')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8827160493827161"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "catagreement[True] / (catagreement[False] + catagreement[True])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('shortstories', 'nonfic', 'uc2.ark+=13960=t08w3hh7p'),\n",
       " ('reprint', 'nonfic', 'njp.32101054938426'),\n",
       " ('nonfic', 'novel', 'nyp.33433075764427'),\n",
       " ('reprint', 'novel', 'mdp.39076006602267'),\n",
       " ('novel', 'shortstories', 'uc1.b3295312'),\n",
       " ('novel', 'shortstories', 'nyp.33433074866611'),\n",
       " ('novel', 'shortstories', 'mdp.39015008170220'),\n",
       " ('shortstories', 'reprint', 'uc2.ark+=13960=t81j9qk9w'),\n",
       " ('nonfic', 'shortstories', 'mdp.39015073934724'),\n",
       " ('nonfic', 'novel', 'nyp.33433075764427'),\n",
       " ('novel', 'shortstories', 'mdp.39015066681480'),\n",
       " ('shortstories', 'novel', 'hvd.32044090343690'),\n",
       " ('juvenile', 'novel', 'uiuo.ark+=13960=t1pg2cj2j'),\n",
       " ('novel', 'nonfic', 'uiuo.ark+=13960=t3kw5sm3s'),\n",
       " ('shortstories', 'novel', 'uc1.$b795540'),\n",
       " ('reprint', 'novel', 'uc2.ark+=13960=t6736q45t'),\n",
       " ('novel', 'nonfic', 'uc2.ark+=13960=t7sn04f3b'),\n",
       " ('shortstories', 'novel', 'wu.89087923348'),\n",
       " ('shortstories', 'novel', 'hvd.32044090343690')]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "catdiffs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5904736562001065"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kappa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0  -  1\n",
      "0  -  2\n",
      "1  -  2\n"
     ]
    }
   ],
   "source": [
    "nationagreement, nationdiffs, kappa = measure_agreement('nationality')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8641975308641975"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nationagreement[True] / (nationagreement[False] + nationagreement[True])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(nan, 'uk', 'uiuo.ark+=13960=t2t44bz9q'),\n",
       " (nan, 'ir', 'uc1.$b686387'),\n",
       " (nan, 'uk', 'nyp.33433075764427'),\n",
       " ('us', 'uk', 'hvd.32044090343690'),\n",
       " (nan, 'us', 'uc1.b3295312'),\n",
       " ('uk', 'us', 'mdp.39015008540703'),\n",
       " ('ir', 'uk', 'uiuo.ark+=13960=t07w6q90x'),\n",
       " ('ir', 'uk', 'mdp.39015059404106'),\n",
       " ('pr', nan, 'mdp.39015073934724'),\n",
       " (nan, 'us', 'mdp.39015000017445'),\n",
       " ('us', 'uk', 'mdp.39015063917739'),\n",
       " ('uk', nan, 'uiuo.ark+=13960=t8sb4r697'),\n",
       " (nan, 'uk', 'nyp.33433075764427'),\n",
       " ('us', 'uk', 'hvd.32044090343690'),\n",
       " ('ukr', 'ru', 'nyp.33433073355830'),\n",
       " ('in', nan, 'mdp.39015061860105'),\n",
       " ('uk', nan, 'uiuo.ark+=13960=t50g48273'),\n",
       " (nan, 'ca', 'uc1.$b323008'),\n",
       " ('ru', 'uk', 'uc1.$b795540'),\n",
       " ('uk', nan, 'uiuo.ark+=13960=t50g48273'),\n",
       " ('uk', 'us', 'nyp.33433074812243'),\n",
       " (nan, 'ca', 'uc1.$b323008')]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nationdiffs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.81097852028639617"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kappa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### what fraction of the errors are caused by blanks?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "13"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ctr = 0\n",
    "for d1, d2, doc in nationdiffs:\n",
    "    if pd.isnull(d1) or pd.isnull(d2):\n",
    "        ctr += 1\n",
    "ctr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5909090909090909"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "13 / len(nationdiffs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Combining the dataframes\n",
    "\n",
    "This is a two-stage process, because we have to first reconcile the disagreements detailed above.\n",
    "\n",
    "#### make list of contested volumes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(72, 18)"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat = 0\n",
    "reasons = dict()\n",
    "allcontested = set()\n",
    "\n",
    "for diffs in [genderdiffs, catdiffs, nationdiffs]:\n",
    "    cat += 1\n",
    "    for a, b, docid in diffs:\n",
    "        if docid not in reasons:\n",
    "            reasons[docid] = \"\"\n",
    "        if cat == 1:\n",
    "            reasons[docid] += \"gender|\"\n",
    "        elif cat == 2:\n",
    "            reasons[docid] += \"category|\"\n",
    "        else:\n",
    "            reasons[docid] += \"nation|\"\n",
    "            \n",
    "        allcontested.add(docid)\n",
    "\n",
    "subset = list(allcontested)\n",
    "\n",
    "tsub = t.loc[list(set(subset).intersection(set(t.index))), : ]\n",
    "psub = p.loc[list(set(subset).intersection(set(p.index))), : ]\n",
    "jsub = j.loc[list(set(subset).intersection(set(j.index))), : ]\n",
    "tsub = tsub.assign(reader = 't')\n",
    "psub = psub.assign(reader = 'p')\n",
    "jsub = jsub.assign(reader = 'j')\n",
    "subsets = pd.concat([tsub, psub, jsub])\n",
    "subsets.shape\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(34, 18)"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# drop duplicate rows\n",
    "\n",
    "subsets = subsets[~subsets.index.duplicated(keep='first')]\n",
    "subsets.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "subsets['debate'] = subsets.index.to_series().map(reasons)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### resolve disagreements"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# subsets.to_csv('weighteddata/contested_volumes.csv', index_label = 'docid')\n",
    "\n",
    "# Note that I'm commenting that out, because I don't want it to re-run \n",
    "# every time this notebook is run. But on the first pass, this is where I\n",
    "# wrote to file, and manually edited, a list of contested volumes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### make list of uncontested volumes, and combine with contested"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(533, 17)"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "allvols = pd.concat([j, p, t])\n",
    "allvols.drop_duplicates(inplace = True, keep = False)\n",
    "allvols.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(34, 17)"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "contested = pd.read_csv('weighteddata/contested_volumes.csv', index_col = 'docid')\n",
    "contested.drop(labels = ['debate', 'reader'], axis = 1, inplace = True)\n",
    "contested.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(567, 17)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "allvols = pd.concat([allvols, contested])\n",
    "allvols.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### reconsider ```reprint``` category"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(15, 17)"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "reprints = allvols.loc[allvols.category == 'reprint', : ]\n",
    "reprints.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "# commenting out because I don't actually want this step to re-run\n",
    "\n",
    "# reprints.to_csv('weighteddata/reprints.csv', index_label = 'docid')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "reprints:  (15, 18)\n",
      "noreprints:  (552, 18)\n"
     ]
    }
   ],
   "source": [
    "reprints = pd.read_csv('weighteddata/reprints.csv', index_col = 'docid')\n",
    "noreprints = allvols.loc[allvols.category != 'reprint', : ]\n",
    "noreprints = noreprints.assign(hathiadvent = 'contemporary')\n",
    "print(\"reprints: \", reprints.shape)\n",
    "print(\"noreprints: \", noreprints.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(567, 18)"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "allvols = pd.concat([noreprints, reprints], sort = False)\n",
    "allvols.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Merging the frequently-reprinted supplement with the title_subset\n",
    "\n",
    "To do this, we first have to remove an equal number of rarely-reprinted volumes, distributed similarly across the timeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "title = pd.read_csv('manual_title_subset.tsv', sep = '\\t', index_col = 'docid')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "toremove = []\n",
    "\n",
    "for floor in range(1800, 2010, 10):\n",
    "    originals = title.loc[(title.inferreddate >= floor) & \n",
    "                          (title.inferreddate < (floor + 10)) &\n",
    "                         (title.copiesin25yrs.astype(int) < 2), : ]\n",
    "    new = allvols.loc[(allvols.inferreddate >= floor) & \n",
    "                          (allvols.inferreddate < (floor + 10)), : ]\n",
    "    k = new.shape[0]\n",
    "    sample = random.sample(originals.index.tolist(), k)\n",
    "    toremove.extend(sample)  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "567"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(toremove)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2730, 18)"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "title.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2163, 18)"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "title.drop(labels = toremove, inplace = True)\n",
    "title.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2730, 18)"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all = pd.concat([title, allvols], sort = False)\n",
    "all.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Redefining categories to avoid misunderstanding\n",
    "\n",
    "In manual coding we used the terms \"novel\" and \"shortstories.\" But these phrases are in reality often misleading. Folktales or anecdotes are not really short stories, and some older or experimental fiction might not be quite \"a novel.\"\n",
    "\n",
    "Let's use looser terms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "def remap_categories(cat):\n",
    "    accepted = {'longfiction', 'shortfiction', 'notfiction', 'poetry', 'juvenile', 'drama'}\n",
    "    \n",
    "    if cat == 'novel':\n",
    "        return 'longfiction'\n",
    "    elif cat == 'shortstories':\n",
    "        return 'shortfiction'\n",
    "    elif cat == 'nonfic':\n",
    "        return 'notfiction'\n",
    "    elif cat in accepted:\n",
    "        return cat\n",
    "    else:\n",
    "        print(cat)\n",
    "        return cat\n",
    "\n",
    "all = all.assign(category = all.category.map(remap_categories))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "all.to_csv('weighted_subset.tsv', sep = '\\t', index_label = 'docid')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### EDA on gender"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "def isfiction(astring):\n",
    "    ''' Note that this doesn't count juvenile fiction.\n",
    "    '''\n",
    "    \n",
    "    if pd.isnull(astring):\n",
    "        return 'not'\n",
    "\n",
    "    if 'longfiction' in astring or 'shortfiction' in astring:\n",
    "            return 'fic'\n",
    "    else:\n",
    "        return 'not'\n",
    "\n",
    "all = all.assign(isfic = all.category.map(isfiction))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "def bootstrap_ratio(numtrue, numfalse):\n",
    "    population = [True] * numtrue + [False] * numfalse\n",
    "    results = []\n",
    "    \n",
    "    for i in range(1000):\n",
    "        sample = np.random.choice(population, size = len(population), replace = True)\n",
    "        ratio = sum(sample) / len(sample)\n",
    "        results.append(ratio)\n",
    "    \n",
    "    results.sort()\n",
    "    return results[49], results[950]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAhgAAAF6CAYAAABbUCHcAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XmcHGW56PHfExJEIutBtmBCDMh2jiKHRRQkiEhwCSjq\nJSZA3A4XMJyreAXvMSYhqHhOFJFFAYGwRFFRWURkkwFUNkX2hEAyDCEBFdlM2JP3/lE1pDOZLTM1\nXdU9v+/nU590V1dXP6m3uuuZt94lUkpIkiQVaUjZAUiSpOZjgiFJkgpngiFJkgpngiFJkgpngiFJ\nkgpngiFJkgpngiGpSxFxfkQ8HRG3dfH6SRHx94hYEhFviYh/RkT04XO+GhFn9z/i1fZ7RETcUvR+\nJfXMBENNLSIejYgXIuL5iHgiv2CuW3ZctSKiNSLeV3YcHUXEXsB+wJYppXd18vpbgC8B26eUtkwp\nLUoprZd6GFwnIvaJiEW161JK30op/UeR8dfufoD2K6kbJhhqdgn4UEppfWAXYFfga2u6k4hYq+jA\nGsDWwKMppZe6eH0U8FRK6R9ruN/Ai77U9EwwNBgEQErpCeBq4F8BImL9iPhRXr2/KCJmtlfv51Xr\nv4+I70bEU8C0fP3nI+LBvEbk/ojYOV+/RURcGhF/i4gFETHl9Q+PmBYRP42IC/L33RcRu+SvXQiM\nBK7MX/tyvv5neY3LMxHREhE71uxv44i4MiKei4jb87hvqXl9+4i4NiL+ERFzI+ITXR6YLO7L823n\nR8Tn8vWfAc4B9szjmtbhffsB1wJb5q+fFxGjImJFRAzJt9koX7843/8v89qj3+Tv+2f+3s3zY3RR\nzf7H58f36Yj4XURsX/Naa0QcFxH35MfnJxGxdjflPyQiTouIZ/Oye1++n49HxJ86/L++FBG/6uQ4\njY2Ie2ueXxcRd9Q8vzkixuePd4iIG/PY7ouIj9Rsd35EnBERv8n//7dExGYRcUr+f30wIt7RoXzW\n+LySKiGl5OLStAvQCrwvf/wW4H5gev78V8CZwDrAJsBtwOfz144AXgWOJkvE3wB8AlgE7JJv89Z8\nnwH8CfgvYC2yv/wfAfbPt5sGvAAckG/7TeDWDjHu2yHuycC6wDDgu8Bfal67BPhxHtMOwGPAzflr\n6+bPD88/6x3A38huY3R2fG4GTss/p33bsTXH4OZuju0+wGM1z0cBy4Eh+fOrgJ8A6+fHZe/O3ldz\njC7MH78NWAq8L3/f/wUeBobWHK/bgM2ADYEHgf/oIsb2cjw239cngWfz960NPAVsV7P9XcDBnexn\nnbwMNwaGAk/m58Lwmtc2zF97GDg+f7wv8Dywbb6f8/NjvHP++TcAC4GJeXnNBH6Xb9uv88rFpeyl\n9ABcXAZyyS9GzwNP549Pyy/MmwIvAW+o2fbQmh/3I8huD9Tu67fAlE4+Y/dOtj0BODd/PA24tua1\nHYBlHWJ8Xzf/hw2BFcB6ZMnOK8A2Na/PZGWC8Ungpg7v/yEwtZP9bpVffNetWfdN4LyaY9CnBAPY\nAngNWL+n99Uco/YE42vAJTWvBfA48N6a4zWh5vVvA2d2EeMRwOMd1t0OTMwfnwnMzB/vBPwDGNbF\nvm4CDgb2AK4hS/Q+AIwF7s632RtY0uF9Pwa+nj8+Hzir5rUvAA/UPP9X4On88R79Oa9cXMpehiI1\nv4NSSjfWroiIUWR/tT/RflckXx6r2WyVhohktRULOtn/KGBERDzdvnuyi+zNNds8WfP4BWCdiBiS\nUlrRcWf5LYZvAh8nq1lJ+bIJWQ3FWmQX3M7iHAW8q0MsawEXsbotyS5mL9SsawP+vZNt19RW+b6f\n78N7t8zjACCllCJrFDqiZpu/1jx+gSyh6criDs/b8s8AuIAsAZgKTAJ+llJ6tYv93ExWI/E40AI8\nQ5ZcvEyWfJDH0fG8aesm9hc7ef6m/PFICjyvpHozwdBg0Fm3yUVkNRj/klLqqsFhx/WLgDFd7Gth\nSmm7PsbX8XM+BXyErFbjsYjYgOxiFsDfyWoGtiKrLocs8amNpSWldEAvPncJsHFEDE8pLcvXjWT1\nC3JfLMr3vX4nSUZPDTyXkLeTqfEWVk2q1sSIDs9HApcDpJRuj4hXImJvsuM+oZv93AR8hyxhOJns\nVss5ZOfRGTWxv6XD+0YCD/Uh7v6eV1KpbOSpQSml9CRZI8VTImK9yLw1It7bzdt+BHy5poHmmMi6\nat4B/DMivhIR60TEWhGxU0Ts2s2+apOeJ8nac7Rbj+yv4mciYjjwLfKLcv6X6S+B6RHxxrzx4+E1\n7/018LaImBQRQyNiWETsWttIsuYYPA78EfhWRLwhIt4OfJbOazt6q71B7ZNkDWrPjIgN81j2zrf5\nK/AvEbF+F/v4GfChiNg3f9+XyS7it/Yxps0iYkq+r08A25M1NG13EXA68EpK6Y/d7OePwHZkt8Tu\nSCk9SFZjtAcraxVuB17Iz4WhETEW+DBZW5Teaj83+nteSaUywVCz6+6v5cPJGto9SNZG4+fA5l3u\nKKVLgW8AP46I58kaiW6cX/Q/TNZwr5WsEd85ZI0bexPXycDUvBfBl8iq7R8jq0m4n+zCVmsKWbuM\nJ1hZxf9yHuNSsnYBh5L9Nb0k339XvSwmAKPz7X5B1lbjxi627Y3a/9dhZLUt88iSiv/MY3yI7IK7\nMP8/r3LMU0rzyW5XnE5WY/Mh4CMppdc6+YzeuA3YlqxB50zgkJTSMzWvX0RWY9JtYpXfSvozcH9N\nLLeStZN4Kt/mVbLapw/mn3c6cFhK6eE1iL02mezPeSWVKrquHa7ZKGIc8D2yhOTclNK3O9lmLHAK\n2X3tv6eU9i02VEmdiYiTgc1SSp8uO5ZGFBHrkCVAu6SUOmtjI6kPemyDkTc4O51sRL8lwJ0RcXlK\naV7NNhuQ3YP8QEppcURsMlABS4NdRGwHrJ1Sui8idie7rfGZksNqZEcDd5pcSMXqTSPP3YGHU0pt\nABFxCXAQWbVnu08Bv0gpLQZory6UNCDWA34SEVuQ/eX9PymlK0uOqSFFRGv+8OBSA5GaUG8SjBGs\n2u3qcbKko9bbgGERcSNZF6vvp5T601BMUhdSSn8ia1OgfkopjS47BqlZFdVNdSjZPA/vIxvZ7taI\nuDWl9Ej3b5MkSc2oNwnGYrJ+3O22YvV+8o+TTXr0EvBSRNxMNuzwKglGRNjCWZKkJpJS6rR7dG+6\nqd4JbBPZREZrk3V/u6LDNpcDe+X9tNcl6xc+t4tA+r0UtR+XYpdp06aVHoOLZeti+brUr2y702MN\nRkppeUR8gWxQovZuqnMj4sjs5XR2SmleRFwD3Es2F8HZKRuERpIkDUK9aoORUvot2Qh2tevO6vB8\nFjCruNAkSVKjciRPFWbs2LFlh6ABYtk2N8u3eZVZtr0aybOwD4tIRXxeRPR470eSJA2s/Hrc50ae\nkiRJa8QEQ5IkFc4EQ5IkFc4EQ5IkFc4EQ5IkFc4EQ5IkFc4EQ5IkFc4EQ5IkFc4EQ5IkFc4EQ5Ik\nFc4EQ5IkFc4EQ5IkFc4EQ5IkFc4EQ5IkFc4EQ5IkFc4EQ5IkFc4EQ5IkFc4EQ5IkFc4EQ5IkFc4E\nQ5IkFc4EQ5IkFc4EQ5IkFc4EQ5IkFc4EQ5IkFc4EQ5IkFc4EQ5IkFc4EQ5IkFc4EQ5IkFc4EQ5Ik\nFc4EQ5IkFc4EQ5IkFc4EQ5IkFa6hEozW1jYmTZoB7MWkSTNobW0rOyRJktSJSCnV78MiUl8/r7W1\njf33P40FC2YAw4FljBkzjeuum8Lo0aMKjVOSJPUsIkgpRWevNUwNxtSps2uSC4DhLFgwg6lTZ5cY\nlSRJ6kzDJBiLF69gZXLRbjhLlqwoIxxJktSNhkkwRowYAizrsHYZW27ZMP8FSZIGjYa5Os+cOZkx\nY6axMsnI2mDMnDm5tJgkSVLnGqaRJ2QNPadOnc2cOfMYOXJ7Wlom28BTkqSSdNfIs6ESjJX72YCN\nNnqOBx6ALbYoIDBJkrTGmqIXyaqeZ8IEuOGGsuOQJEmdadAajGD58sSQBk2PJElqBk1Yg4HJhSRJ\nFeZlWpIkFc4EQ5IkFa5XCUZEjIuIeRExPyKO7+T1fSLi2Yi4K1++VnyokiSpUfSYYETEEOB04ABg\nJ2BCRGzfyaY3p5R2yZeTCo6zSynBUUfB88/X6xMlSVJPelODsTvwcEqpLaX0KnAJcFAn23XainSg\nRcA//gEXXVTGp0uSpM70JsEYASyqef54vq6jPSPi7oi4KiJ2LCS6Xjr6aDjjjKw2Q5Ikla+oRp5/\nBkamlHYmu51yWUH77ZV99sm6rba01PNTJUlSV4b2YpvFwMia51vl616XUlpa8/jqiDgzIjZOKT3d\ncWfTp09//fHYsWMZO3bsGoa8uoiVtRj77tvv3UmSpE60tLTQ0su/5nscyTMi1gIeAvYDngDuACak\nlObWbLNZSumv+ePdgZ+llLbuZF+FjeTZcT///Cdsvz3MnQvrr9/vj5AkST3obiTPHmswUkrLI+IL\nwLVkt1TOTSnNjYgjs5fT2cDHI+Io4FXgReB/FRd+76y3HixcCG94Q70/WZIkddSwc5HUM25JkrS6\nppyLRJIkVZcJhiRJKlxvepFIDa221XNLS8vrPZeK6sUkSVpdU7bBeOghuP56OOaYfn+UmoztdySp\nON21wWjKBOPJJ2GHHaC1FTbcsN8fpyZigiFJxRl0jTw33xzGjYMLLig7EkmSBqemTDAgG9nzzDNh\nxYqyI5EkafBp2kaee+2VDbr1u9/B+99fdjSSJJWnjMbuTdkGo91ZZ8Ett8DFF/f7I9UkbIMhabAr\n8ndw0DXybPfSS9kU7m98Y78/Uk3CBEPSYFevBKNpb5EArLNO2RFIkjQ4NW0jT0mSVB4TDEmSVDgT\nDEmSVLhBk2D88pfwwANlRyFJ0uAwaBKMefPglFPKjkKSpMGhqbup1vrb32C77WDhQthoo36HoAZl\nN1VJg129uqkOmhqMTTeFD30Izj+/7EgkSWp+g6YGA+DWW+Gww2D+fBgyaFIr1bIGQ9Jg50ieHRQx\njnpKsMsu8D//4/wkg5UJhqTBzgRjgCxeDFtsYQ3GYGWCIWmwM8GQBkDZCUYZMxpKUi0TjArzItG4\nyk4walUpFkmDhwlGg/Ai0ViqVF5VikXS4GE3VUmS1LAGbYLx0kuOiSFJ0kAZtAnGsGEwYwb86U9l\nRyJJUvMZtAnGWmvBUUfBGWeUHYkkSc1nUDfyfOop2HZbeOQR+Jd/6ds+bKjXWKpUXlWKRdLgYSPP\nOthkExg/Hs47r+xIJElqLoO6BgPgjjvg0EOzWoy+jO7pX6GNpUrlVaVYJA0ejoNRR/feC29/e9/e\n60WisVSpvKoUi6TBwwSjQXiRaCxVKq8qxSJp8LANhiRJalgmGJIkqXAmGJIkqXAmGDXa2rJeJZIk\nqX9MMGo88EA2uqft7iRJ6h8TjBrjxsGzz1qL0YxaW9uYNGkGsBeTJs2gtbWt7JAkqanZTbWDWbOy\ncTEuvLB329vVcFUtLS20tLS8/njs2LEAjB079vXH9dba2sb++5/GggUzgOHAMsaMmcZ1101h9OhR\npcQEnjuSyuE4GCV5+mkYMwbmz4c3v7nn7b1IdK0qx2bSpBnMmfNlsuSi3TImTpzFxRdPKyusyhwf\nSYOL42CUZOON4aMfhXPPLTsSFaWtbQWrJhcAw1myZEUZ4UjSoDC07ACq6Otft6Fns3j5ZXjooSHA\nMjrWYGy5pfm1JA0Uf2E7sfXWMHp02VGoCGuvDV/72mTe+tZpZEkGtLfBmDlzcnmBSVKd1buxu20w\n+sn76F2r0rFpbW1j6tTZzJlzPRMnvp+ZMyeX2sATqnV8JDW3gWrsbiPPAeRFomtVPDZdxXT22bD5\n5jB+fPmxSFLRBqqxu408pR7stBP8539mA6298ELZ0UhSsRYvrn9j914lGBExLiLmRcT8iDi+m+12\ni4hXI+JjxYVYrjvugGXLet5O1fCrX8GiRWv+vve8B+6+G5YuhV12gbvuKj42SSrLK6+0N3avNbCN\n3Xvcc0QMAU4HDgB2AiZExPZdbHcycE3RQZbppJPgxz8uOwr1xlVXZTUQS5f27f0bbAAXXQTTpmWj\nuv7kJ8XGJ0ll2WKLyWy1VX0bu/cmddkdeDil1JZSehW4BDiok+2mAJcCfyswvtIdcwyccYbdVqvu\nj3+EyZPh8sthhx36t68JE+DOO2HPPQsJTZJKd+mlo7j55ilMnDgL2JuJE2cN+GjGPTbyjIhDgANS\nSv+RP58E7J5SOrZmmy2BOSmlfSPifODKlNIvO9lXwzXyXLECttsOLrgA3v3u1V+3oV7X6nVs7r8f\n9tsvG979gAOqEVNvVCkWSYNHo43k+T2gtm1Gpx/WiIYMgaOPzmoxVD1PPw0HHginnNJzciFJqp/e\njOS5GBhZ83yrfF2tXYFLIiKATYADI+LVlNIVHXc2ffr01x+XOQHWmpg8GU48Ef76V9hss7KjUa2N\nNoJf/hJ2223gPyslmDIFJk709omkwal2Qsue9OYWyVrAQ8B+wBPAHcCElNLcLrZvqlsk7a65Jruo\nrL/+quut5u5aFY9Nf2O67DL43/87W772NRjaj8H2q3h8JDWHSy7JbuuPHLn6a5W5RZJSWg58AbgW\neAC4JKU0NyKOjIj/6Owt/Yq2og44YPXkQoPPwQdnXVhvvRXe+15YuLDsiCRpVffck9W2ls2RPPvJ\nv0K7VsVjU1RMK1bA978P3/gGXH017LprebFIUruXX85uGR93HBxxROfb1KsGwwSjn7xIdK3oY5NS\ndkH/1KfgrW+tRkwPPgjbbJNNqlZ2LJL01a/CvHlZ27TooruFCUaD8CLRtaKPzUknwaWXwk03ZYNi\nVSGm/qhSLJIa3x/+AB//eHaLZNNNu96uMm0wtKrly7O/WlVfZ50F558Pv/1t35MLSWpmDz0EP/xh\n98lFPVmDsYYefxze8Q5oa4M3vcm/QrtT1LH5xS/g2GPh5pthzJhqxNSd55+Hww6Db38btl9tUP36\nxiKpHLXdOVtaWl4fkqEKwzN4i6TCPvYx+MAHsq6KXiS6VsSxefRR2H33rJvwO99ZjZh6klI2/ft/\n/Vd2W+fIIzu/F+q5Iw0OVfuum2BU2PXXwxe/CPfeC0OGVOvEqZKiTuIlS2DLLQsIiPp+0efNywbl\nGjECfvSj1astq/ajI2lgVO27Xq8Eox/DBA1e++0Hr74Kv/992ZFkqlwVV4Sikot62377bLyMr389\nq325//5s5FFJGgysweijr3+9jQsumM1jj13PxInvZ+bMyQM6K11vVSlTrlIs7cqK6ZFHsu6sAK2t\nbUydOps5c6p17kgaGAP1u3PZZbDOOjBuXHnxeIukYK2tbey332m0ts4AhgPLGDNm2oBPfdsbVbqo\nVymWdmXH1Nraxv77n8aCBdU7dyQNjIH43VmyJKsZvfLKrJ1aWfHYTbVgU6fOrkkuAIazYMEMpk6d\nXWJUje/FF+HQQ2HRorIjGThTp86uSS7Ac0fSmkoJPve5rKPBmiYX9WQbjD5YvHgFKy8Q7YazZMmK\nMsJpCq+9BhMmwLrrZo0im5XnjppJs7f/qqpzzslm9/7a18qOpHsmGH0wYsQQYBmrXiiWseWWVgj1\nRUpZJv7ii/Czn8GQJj6MnjtqJrWJRET0ehpv9d2CBfD//l82LtCwYWVH0z0TjD6YOXMyt902bbX7\n6PvuO4Vly2B4xz9Q1a3/+i+47z644Ya+zenRSDo7d9Zaaxq77lr/qQ/961NqPM8/D9/5Duy4Y9mR\n9MxGnn3UsSfAiSdO5lvfGsVdd2WNbsrqWll2I8ZavYnlz3+GSZPglltgk02qEdNA63juHHbYZI44\nYhTnnAMf+Ug5MVXhuKixeQ51rWrHxl4kDaK2oFKCb30rGwv+yiuzIcXLjKdsvY3lxRfhjW+sQ0BU\n9/jccQccckg2eFsZY2VU6bioMXkOda1qx8YEo0F0VlA//SlMmQKzZ8MHP1h+PGWpUiztqhRTx1j+\n+U9Yb71qxCKtKc+hrlXt2JhgNIiuCurWW7M5KG69tb5tMqp0IlcplnZVislY1Ew8h7pWtWNjgtEg\nuiuo5cthrbWqE0+9dRZLSp1P/FUvVT8+ZalSLOpc1Rvleg51rT/H5tZb4be/hRkzqhFPF/sywRgI\nVftSVSmejrE89hgcfHDWW6SsOTmqfHw6s2JFfbrtVum4qGdVLK8qxlQVfT02y5bBzjvDySdnbbTK\njqebfTmSp8rz1FNwwAFZjxEn/OqdRx6B3XbLjp2kwecrX4E99yw2uagnE4w6O/NMuPvusqOor2XL\n4MMfhoMOgi99qexoGsc228AHPgAHHpj1fZc0eFx7bdYb8fvfLzuSvjPBqLNNN80uGr/+ddmRDJzW\n1jYmTZoB7MWECTM48MA2dtwx68KrNfPNb2a1GOPHZ915JTW/Z56Bz34WzjsPNtyw7Gj6zjYY/dSX\ne1m33w4f/SiccAIce2z58RSps9lC1113GnffPYVtty1/ttCyj0+t3sayYgUcdhg89xz86lcDMzxw\nlY6LelbF8qpiTFWxpsfm2Wez2ovDDqtGPL3Yl20wqmKPPeCPf4SzzsoSjOXLy46oOJ3NFvrCCzOY\nMWN2iVE1tiFDsjFV1lsvG1JdUnPbcMOBSy7qyblISrL11vCHP2QDcv3977D55mVHVIwqzhZa271v\nn332Yfr06UB1uvf1xrBh8JOflB2FJPWeCUaJNtwQLrqo7CiK8fe/wzXXVHO20EZKJCSpWXiLRP3y\nyCNw9NGw3XbZhGXTp09mzJhpZEkGtM80O3Pm5PKClCTVnQmG+uTOO+HjH8/6aG+8Mcydm7Up2Wab\nUVx33RQmTpwF7M3EibO47ropjB5dfgPPZrRoUdkRSOqv1tZsnKBmao8H9iLpt4FoOX3yydnsosce\nu+bDaterJfepp2axfeYz8KY3lRtLo+rv8Xn5ZdhhB5g5EyZOLDcW1VcVy6uKMVVFT1NK7Ltv1hX9\ny18euBgGaqh5hwofQAPxpXr0UfjQh7KT7nvfg6Fr0FKmSl/yKsVSRUUcnwcegP32g7PPzn6gyoxF\n9VPF8qpiTFXR3bH5znfg8svhxhvrP3dVEeym2mC23jrrxjp/fnbR+Oc/y4nj6aezC5e/GdW1007Z\noG2f+xz87ndlRyNpTTzwQFZjPXt2YyYXPTHBqKgNNoCrroKRI2GvveDxx+v32a2t2e2ZbbbJZvJz\nBMlq23VX+PnP4dBDs0HcJFXfq6/C4YdnIxy/9a1lRzMwTDAqbNgw+MEP4KijYO21B/7z7rknu0jt\numvWBuS+++D882HddQf+s9U/++yTldUrr5QdiaTeGDoUTjwxGxK8WdkGo5+qdt+xP/FceGE2nsXn\nPw/rr19uLINBlY5PlWJRz6pYXlWMqSqa+djYyHMAVe3EqVI8VYqliqp0fKoUi3pWxfKqYkxV0czH\nxkaeTei117JlTT33HJx+et/eK0lSb5lgNKhzzoEPfxief7532y9aBMcdlzUmuu223r9Pje3qqy1r\nqSqabSCtnphgNKjPfx5Gj856mDz2WNfbPfhgNivfzjtnz//yF7j44mz0TTW/3/wGPvIRewJJZXv2\nWXjHO+DJJ8uOpH5MMBrU0KFw5pkweXI2XPfll7cxadIMYC8mTZpBa2sbAAsXwr/+KyxYkA3oMnJk\nqWGrzk49FbbaKhvW3R4mUn21tq78Xd5llxnsvHNb08yc3Rs28uynKjTe+eEP2zjmmNNYsWIG2Sym\n2QRjZc8BUoVjU2X1Oj6vvgqHHJJ1N54zp/MBfSyrzg3U8Mr9VcXyqmJMZWptbWP//U9jwYKVv8uj\nR0/jhhuaa24me5EMoCp8qSZNmsGcOV+m4xTpEyfO4uKLp5UVViWOTZXV8/i89BJ88INZFe0pp5Qb\nS6Oq0jGqUiztqhhTmar6u1y07hKMNZjlQlW1ePEKVj2JAYazZMmKMsJRBa2zTjbfwcKFZUciDQ7+\nLtsGoymMGDEEWNZh7TK23NLi1UrrrZfVYEgaeP4um2A0hZkzJzNmzDRWnsxZG4yZMyeXFpMkDWb+\nLptgNIXRo0dx3XVTmDhxFrA3EyfOKr2BpxpDbSv32t5HkvrH32UbefZb1Ro2VSmeKsVSRWUfn9bW\nNt71rtP429+q1fuoqsour1pViqVdFWOqimY+Ng4VLmk1U6fOrkkuAIazYMEMpk6dXWJUUmNburTs\nCKqjVwlGRIyLiHkRMT8iju/k9fERcU9E/CUi7oiI9xQfqqQi2cpdKtYrr8B73wu/+13ZkVRDjwlG\nRAwBTgcOAHYCJkTE9h02uz6l9I6U0juBzwI/KjxSSYWylbuKYDuelWbOhC23hH33LTuSauixDUZE\nvAuYllI6MH9+ApBSSt/uYvs9gR+llHbq5DXbYAywKsVTpViqqOzj09lIg8OGTeP++6fwtreV0waj\nqiNnQvnlVasqsXR2Dg3Wdjy33QYHHwx3381qw4FXpbwGQndtMEgpdbsAhwBn1zyfBHy/k+0OBuYC\nTwF7dLGv1Gyq9n+qUjxViqWKqnB8Fi58NE2cOD3BXmnixOnpgQceLTuk11Xh+NSqQjwdy2vhwnLL\na8KE6QmWJkg1y9I0ceL0UuOqt2XLUtp225R+/vPOX6/CuTNQ8v9bp/lDYSN5ppQuAy6LiL2Ak4D9\nO9tu+vTprz+uwl8lUj3V/oW+zz77vP59KOu7MHr0KC6+eBpz5kzn4otvqfvnq/dWrS2Yxpw5y7jt\ntmlcddUURo0axTrrrP6exYvhvvuy2XRfeilbXnwR3vY2eP/7V9/+hhvgBz9YdduXXoLx42FaJ6Nb\n33WX7XgA/vu/Yffds0kFm13tb1hPenuLZHpKaVz+vNtbJPk2C4DdUkpPd1ifevq8RlO1qq8qxVOl\nWNS9KpZV1WIqO56u5raImMWRR07jBz9Y/T1XX53NqLvOOtnyxjdm/773vTBhwurbL1gAf/nLyu3b\nl802g7e8pfcxNdt8Gz1Ztgxeew022KDz18s+dwZSf+ciuRPYJiJGAU8AhwKrnJoRMSaltCB/vAuw\ndsfkQlLrUYTRAAASY0lEQVRjWrYMhnf8I1V111Wvn7FjV3SaXAAceGC29NaYMdnSWzNnTua226at\n0gZjxIhpzJw5pfc7aQJ+PzrXY4KRUloeEV8AriXrdXJuSmluRByZvZzOBg6JiMOBV4AXgU8OZNBl\nq1o1tzRQ5s7NZmG98UbYeuuyoxncVvb6WbW2oMxeP+2jVU6dOos5c65nv/3ezwknDL4GnuqcI3k2\nmSpVxVUpFnWvu7I6/XT47nfhpps6ryYvI6Yy1DueFStgSE3uUPUeG1Urrypp5mPjSJ6S+uwLX8iW\n/faDJUvKjqb5PfssfPGLcNhhq653bovq+Mc/4IUXyo6i+qzBaDJVypSrFIu615uyOvlkuOACaGnJ\nGv1VIaZ6Guh4li+HH/0o660xfjycdBJsumk5sfRFFWMaCCll5TN2LBx3XO/e08zHpr+NPCWJE06A\nYcPgqafqk2AMJjfdBMceCxtumPX8eOc7y46oGCnB/Pmw3XYD/1n1GqTtvPOy7r9TBlc71j6xBqPJ\nVClTrlIs6l4Vy6pqMQ1kPGeeCW9+czaOQnQ+JmLdYumrzmJatAh22QWuuiobJ6LMWIqwcCHssUdW\ni7fTamNV1z+eKuiuBsMEo8lU6USuUizqXhXLqmoxVSmeKsXSrquYrrgCjj4abr8dRowoN5b+WL48\nm2PkoIN6f2tkIOOpCm+RSFIFrFiR1VD0ppaiWYwfDw8+mM3TcdNNsO66ZUfUN7/5TVZu/+f/lB1J\n47AXiaR++fnPs8G41L3bboM994Trry87kvo7/visHcZnPpO1y2hEH/lI1j5mrbXKjqRxmGBI6rOU\nsh/d8eOzeSu0usWL4fDD4ZBD4Jhjsu6+g00EnHNONlT5M8+UHU3fNWrtS1lMMCT1WfuFY/PN4aMf\nzSbGUubll+Eb34C3vx222grmzcsSjSGD9Ff3jW+E88+HjTcuOxLVyyA91SUVZa21svEx1l8fPvEJ\neOWVsiOqjiVL4M474ZvfhPXWKzsaqb7sRdJkqtRauUqxqHtFlNWrr2YJxmabwVlnVSOmIlUpnirF\n0q5KMRURy7Jl0NYGO+5YjXiqym6qg0iVTuQqxaLuFVVWL7+c/dU+enR1YipKd/F0nDekzFjK0peY\nUsqmOR82rPxYOjr66Ox8PvfcasRTVc5FIqku3vCGYpKLRvHKK3DKKbDbbtk4CVozp50GRx5ZvZ4l\n11yTDQ72ne+UHUljM8GQpD64+uqsAec118BFF9l9sS8++1m4664sSauKp5/O4jrvvGzodvWdt0ia\nTNlVcfWaD0DFGujzJqU1H1yq7HO5XWtrG1OnzmbOnOuZOPH9fPrTk/nud0fx8MPZhfGDH6z/wFlV\nOTa1+hrTY49l44Occ052LMuMBWDChGySuVNPLSaW/sZTdbbBGESa+UTWwBnI8+bCC+HPf4bvfW/N\nLsRVOJdbW9vYf//TWLBgBjAcWMYWW0xj8uQpTJ8+irXXLieuKhybjvoT0623ZkNwt7SU26jy0Uez\nhspFjzhaxfIqim0wJJVm/Hj4/e/hK1+p3r32nkydOrsmuQAYzhNPzOCxx2aXllw0oz33hFmzYMaM\ncuPYeutszhQH1CqGc5FIGlAbbgjXXZdNFLX22nDSSdWfi2PevGwI9IULV7AyuWg3nCVLVpQRVlM7\n/HD41KfKjmLwDoQ2EDyUkgbcxhtnc3BcdhnMnFl2NJ2bOxdOPBH+7d+y4bz//nfYbLMhQMeJVpax\n5Zb+dA6Eof7J21Rsg9EEbFip/qrXPeInn4TDDoOf/Qw22qgaMUE20uYZZ8DHP57dg3/3u7O/ZDtr\ngzFmzDSuu24Ko0ePqktsnaniPf0qxVSlWKB68RTJRp6SulXFH8B6xrR0aXbfvbPq8Y69SGbOnFxq\ncgGWV096G8vy5XDFFdlU8gN5265Kx6ZoJhiSulXFH8AiY3rggaxNxYIF2ZgVZcfTX1WKpV3RMaUE\nn/scnHACbLvtwMRy8slw7bXZ7buBbHtRxfIqir1IJA06DzwA06fDTjvBuHHw3HNw1FFlR6XeioA9\n9sh6IT33XPH7v+eebKTO2bNt2DlQrMGQVPpfWPfem42KWas/MaUEe++dDeH9yU9mF6r+XkTKPka1\nqhRLu4GKacoUeOQR+PWvez9aak+xvPxydm4cdxwccURBgXYwWNrGeYtEUrfKvGA99VSWXMyatWo3\nxd7ElFJ2H70evQ+qdFGvUiztBiqm116DAw/MzpHezg3SUywnnADz58MvflH9LtNV5y0SSZW1ySbZ\nffAvfSlrJ9GTlOC+++DrX4cddsh6f6h5DR2a9Tq68sqsQWZ/vfJKVmN21lkmFwPNGgxJlfiL+O67\n4YAD4MQT27jlltV7bTz+eHZR+PnP4cUXs+6kn/gE7L57fS4UVThG7aoUS7uBjmnRIthsM3o1gmoV\nj0+z8haJpNVU8R7x5Ze38bGPncaKFauPO7F06SguuKC+SUWtKl20qhRLuyrFVKVYmp0JhqSGMGnS\nDObM+TKrDs+9jIkTZ3HxxdPKCguo1kWrSrG0q1JMVYql2dkGQ1JDWLzYuT+kZmGCIakyRoxw7g/1\nzjPPwAUX9LxdSllD4OefH/iYtCq/tZIqY+bMyYwZM42VSUbWBmPmzMmlxaRqeu21bHr3H/+4++0u\nuSRLMIYNq09cWsk2GJIqpYpzf0D59/Wr2Ci3VhnH5777splvf/3rrOFvx1gWL4Z3vhOuvhr+/d/r\nGtqgYSNPSQ2n7At6R1WLpwqqkPRccQUcfTTcfjuMGJGtiwhWrEiMGwfveU82ZooGhgmGpIZTtQt6\n1eLRSiefDJdeCjffnM2KGxGccUbiggvgD3+oz0ivg1V3CYaHXZLU0I4/HrbaChYvbmPGjNnAXpxx\nxgxOPXUyQ4eWf3ttsLIGQ1IlVa3GoGrxaFWtrW3sv/9pLFiw+iBtVWjD06wcB0OS1NSmTp1dk1wA\nDGfBghlMnTq7xKgGNxMMSVLDc5C26jHBkCQ1PAdpqx6PvCSp4TlIW/WYYEiSGt7o0aO47ropTJw4\nC9ibiRNn2cCzZPYikVRJVeu1UbV41DXLqn7sRSJJkurKBEOSJBXOBEOSJBWuVwlGRIyLiHkRMT8i\nju/k9U9FxD358vuI+LfiQ5UkSY2ixwQjIoYApwMHADsBEyJi+w6bLQTem1J6B3AScE7RgUqSpMbR\nmxqM3YGHU0ptKaVXgUuAg2o3SCndllJ6Ln96GzCi2DAlSVIj6U2CMQJYVPP8cbpPID4HXN2foCRJ\nUmMrdLr2iNgX+DSwV5H7lSRJjaU3CcZiYGTN863ydauIiLcDZwPjUkrPdLWz6dOnv/547NixjB07\ntpehSlJ9tbS00NLSAsA+++zz+u+Xv10arGq/Ez3pcSTPiFgLeAjYD3gCuAOYkFKaW7PNSOAG4LCU\n0m3d7MuRPCX1iqMxqq88d+qnu5E8e6zBSCktj4gvANeStdk4N6U0NyKOzF5OZwNTgY2BMyMigFdT\nSrsX91+QJEmNxLlIJFWSf4Wqrzx36se5SCRJUl2ZYEiSpMKZYEiSpMKZYEiSpMKZYEiSpMKZYEiS\npMKZYEiSpMKZYEiSpMKZYEiSpMKZYEiSpMI5VLikyqidqbGlpeX1GUudvVRrwqHC66e7ocJNMCRJ\nTcUEo36ci0SSJNWVCYYkSSqcCYYkSSqcCYYkSSqcCYYkSSqcCYYkSSqcCYYkSSqc42BIkhqeg7SV\nw4G2JElS4RxoS5Ik1ZUJhiRJKpwJhiRJKpwJhiRJKpwJhiRJKpwJhiRJKpwJhiRJKpwJhiRJKpwJ\nhiRJKpwJhiRJKpwJhiRJKpwJhiRJKpwJhiRJKpwJhiRJKpwJhiRJKpwJhiRJKpwJhiRJKpwJhiRJ\nKpwJhiRJKpwJhiRJKpwJhiRJKpwJhiRJKpwJhiRJKpwJhiRJKpwJhiRJKpwJhiRJKlyvEoyIGBcR\n8yJifkQc38nr20XEHyPipYj4UvFhSpKkRjK0pw0iYghwOrAfsAS4MyIuTynNq9nsH8AU4OABiVKS\nJDWU3tRg7A48nFJqSym9ClwCHFS7QUrpqZTSn4HXBiBGSZLUYHqTYIwAFtU8fzxfJ0mS1CkbeUqS\npML12AYDWAyMrHm+Vb6uT6ZPn/7647FjxzJ27Ni+7kqSJNVRS0sLLS0tvdo2UkrdbxCxFvAQWSPP\nJ4A7gAkppbmdbDsNWJpS+k4X+0o9fZ4kSWoMEUFKKTp9rTcX/IgYB5xKdkvl3JTSyRFxJJBSSmdH\nxGbAn4D1gBXAUmDHlNLSDvsxwZAkqUn0O8EoMBATDEmSmkR3CYaNPCVJUuFMMCRJUuFMMCRJUuFM\nMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJ\nUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFM\nMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJ\nUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFM\nMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuF6lWBExLiImBcR8yPi+C62+X5EPBwR\nd0fEzsWGKUmSGkmPCUZEDAFOBw4AdgImRMT2HbY5EBiTUtoWOBL44QDEqopraWkpOwQNEMu2uVm+\nzavMsu1NDcbuwMMppbaU0qvAJcBBHbY5CLgQIKV0O7BBRGxWaKSqPH+kmpdl29ws3+ZV9QRjBLCo\n5vnj+brutlncyTaSJGmQsJGnJEkqXKSUut8g4l3A9JTSuPz5CUBKKX27ZpsfAjemlH6aP58H7JNS\n+muHfXX/YZIkqaGklKKz9UN78d47gW0iYhTwBHAoMKHDNlcAxwA/zROSZzsmF90FIUmSmkuPCUZK\naXlEfAG4luyWyrkppbkRcWT2cjo7pfSbiPhgRDwCLAM+PbBhS5KkKuvxFokkSdKa6ncjz4g4NyL+\nGhH31qx7R0TcGhF/iYg7ImLXmte+mg/INTciPlCzfpeIuDcfzOt7/Y1L/bcmZRsRoyLihYi4K1/O\nrHmPZVtBXZTv2yPijxFxT0RcHhFvqnnN726DWJOy9bvbWCJiq4j4XUQ8EBH3RcSx+fqNIuLaiHgo\nIq6JiA1q3lPOdzel1K8F2AvYGbi3Zt01wAfyxweSNQAF2BH4C9mtma2BR1hZi3I7sFv++DfAAf2N\nzaWuZTuqdrsO+7FsK7h0Ub53AHvljycDJ+aP/e420LKGZet3t4EWYHNg5/zxm4CHgO2BbwNfydcf\nD5ycPy7tu9vvGoyU0u+BZzqsXgG0Z08bko2LATAeuCSl9FpK6VHgYWD3iNgcWC+ldGe+3YXAwf2N\nTf2zhmULsFojXsu2uroo323z9QDXA4fkj/3uNpA1LFvwu9swUkpPppTuzh8vBeYCW5ENeHlBvtkF\nrCyr0r67AzUOxheBWRHxGPDfwFfz9V0NyDWCbACvdp0N5qVq6KpsAbbOq1hvjIi98nWWbWN5ICLG\n548/SfbDBX53m0FXZQt+dxtSRGxNVlN1G7BZyntvppSeBDbNNyvtuztQCcZRwH+mlEaSXZDOG6DP\nUf11VbZPACNTSrsAxwE/rr1/r4bxGeCYiLgTGA68UnI8Kk5XZet3twHlZXQp2e/xUqBjj43Se3AM\nVIJxRErpMoCU0qXAbvn6xcBbarbbKl/X1XpVT8ey3T1//EpK6Zn88V3AAuBtWLYNJaU0P6V0QEpp\nN7J5hxbkL/ndbXBdla3f3cYTEUPJkouLUkqX56v/2j4HWH7742/5+tK+u0UlGMGq9/AWR8Q+ABGx\nH9k9H8gG5Do0ItaOiNHANsAdeXXOcxGxe0QEcDhwOaqCnsp2fv54k8hm3iUi3kpWtgst28pbpXwj\n4s35v0OAr7FyZmS/u42nV2Xrd7chnQc8mFI6tWbdFWSNdwGOYGVZlffdLaBF64+BJcDLwGNkg2y9\nG/gTWcvVW4F31mz/VbJWrHPJeyPk6/8duI8sGTm17Ja6LmtWtsDHgPuBu/LXP2jZVnvponyPJWuV\nPg/4Zoft/e42yLImZet3t7EW4D3AcuDu/Hf4LmAcsDFZ492HyAbG3LDmPaV8dx1oS5IkFc7ZVCVJ\nUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuFMMCRJUuH+Pxn4t+a8Ev4NAAAA\nAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x11cd32da0>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "timeaxis = []\n",
    "percentages = []\n",
    "minima = []\n",
    "maxima = []\n",
    "\n",
    "for decade in range(1800, 2010, 15):\n",
    "    in_decade = all.loc[(all.firstpub >= decade) & (all.firstpub < (decade + 15)) & ((all.isfic == 'fic')), : ]\n",
    "    masculine = sum(in_decade.gender == 'm') \n",
    "    feminine = sum(in_decade.gender == 'f')\n",
    "    pct = feminine / (masculine + feminine)\n",
    "    timeaxis.append(np.mean(in_decade.firstpub))\n",
    "    percentages.append(pct)\n",
    "    minimum, maximum = bootstrap_ratio(feminine, masculine)\n",
    "    maxima.append(maximum)\n",
    "    minima.append(minimum)\n",
    "\n",
    "plt.figure(figsize = (9, 6))\n",
    "plt.xlim(1800,2010)\n",
    "plt.ylim(0.0, 0.6)\n",
    "plt.title('Percentage of fiction by women')\n",
    "downward = np.array(percentages) - np.array(minima)\n",
    "upward = np.array(maxima) - np.array(percentages)\n",
    "plt.errorbar(timeaxis, percentages, yerr = [downward, upward], fmt='--o', ecolor = 'k', color = 'b')\n",
    "plt.savefig('figures/weightedsubset/pctwomen.png')\n",
    "plt.show() "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
